2.1 A Structural Model For the Johnson & Johnson data, say \(y_t\) , let \(x_t=\log(yt)\).In this problem,we are going to fit a special type of structural model, \(x_t = T_t + S_t + N_t\), where \(T_t\) is a trend component, \(S_t\) is a seasonal component, and \(N_t\) is noise. In our case, time t is in quarters (1960.00, 1960.25, . . . ) so one unit of time is a year.
x_t <- log(y_t)
Q <- factor(cycle(x_t))
lm.out <- lm(x_t ~ time(x_t) + Q + 0,na.action = NULL)
sum1 <- summary(lm.out)
sum1
##
## Call:
## lm(formula = x_t ~ time(x_t) + Q + 0, na.action = NULL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29318 -0.09062 -0.01180 0.08460 0.27644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## time(x_t) 1.672e-01 2.259e-03 74.00 <2e-16 ***
## Q1 -3.283e+02 4.451e+00 -73.76 <2e-16 ***
## Q2 -3.282e+02 4.451e+00 -73.75 <2e-16 ***
## Q3 -3.282e+02 4.452e+00 -73.72 <2e-16 ***
## Q4 -3.284e+02 4.452e+00 -73.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1254 on 79 degrees of freedom
## Multiple R-squared: 0.9935, Adjusted R-squared: 0.9931
## F-statistic: 2407 on 5 and 79 DF, p-value: < 2.2e-16
cat("The average annual increase in the logged earnings per share is ",round(coef(sum1)[1,1],2)," per year",sep='')
## The average annual increase in the logged earnings per share is 0.17 per year
In any given year, logged earnings for the fourth quarter are:
\[ x_{t,4} = \beta t + \alpha_4 + w_t \] while third quarter earnings are: \[ x_{t,3} = \beta t + \alpha_3 + w_t \] The difference between these is: \[ x_{t,4}-x_{t,3} = \alpha_4-\alpha_3 \]
cat("The average logged earnings per share ",ifelse(coef(sum1)[5,1] > coef(sum1)[4,1],'increases','decreases')," by ", abs(coef(sum1)[5,1]-coef(sum1)[4,1]),sep='')
## The average logged earnings per share decreases by 0.2687577
There is perfect collineairty betweent the dummy variables and the intercept, meaning that there is no unique OLS solution. This causes one of the dummies to be dropped. Now, the equation sets quarter 1 as the reference quarter.
lm.out2 <- lm(x_t ~ time(x_t) + Q ,na.action = NULL)
sum2 <- summary(lm.out2)
sum2
##
## Call:
## lm(formula = x_t ~ time(x_t) + Q, na.action = NULL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29318 -0.09062 -0.01180 0.08460 0.27644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.283e+02 4.451e+00 -73.761 < 2e-16 ***
## time(x_t) 1.672e-01 2.259e-03 73.999 < 2e-16 ***
## Q2 2.812e-02 3.870e-02 0.727 0.4695
## Q3 9.823e-02 3.871e-02 2.538 0.0131 *
## Q4 -1.705e-01 3.873e-02 -4.403 3.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1254 on 79 degrees of freedom
## Multiple R-squared: 0.9859, Adjusted R-squared: 0.9852
## F-statistic: 1379 on 4 and 79 DF, p-value: < 2.2e-16
First we plot the original series and the fitted values:
df <- data.frame(time = time(x_t),fitted=predict(lm.out),resid=lm.out$residuals)
p <- autoplot.zoo(x_t)
p + geom_line(aes(x=time,y=fitted),data=df,color='red')
ggplot(aes(x=fitted,y=resid),data=df) + geom_point() + geom_smooth(se=F)
The model does not appear to fit the data very well, particularly in larger fitted values. The residuals do not appear to be white noise.
2.2 For the mortality data examined in Example 2.2:
temp = tempr-mean(tempr)
tempsq = temp^2
lm.out1 <- dynlm(cmort ~ time(cmort) + temp + tempsq + part)
lm.out2 <- dynlm(cmort ~ time(cmort) + temp + tempsq + part + L(part,4))
sum1 <- summary(lm.out1)
sum2 <- summary(lm.out2)
print(sum1)
##
## Time series regression with "ts" data:
## Start = 1970(1), End = 1979(40)
##
## Call:
## dynlm(formula = cmort ~ time(cmort) + temp + tempsq + part)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0760 -4.2153 -0.4878 3.7435 29.2448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.831e+03 1.996e+02 14.19 < 2e-16 ***
## time(cmort) -1.396e+00 1.010e-01 -13.82 < 2e-16 ***
## temp -4.725e-01 3.162e-02 -14.94 < 2e-16 ***
## tempsq 2.259e-02 2.827e-03 7.99 9.26e-15 ***
## part 2.554e-01 1.886e-02 13.54 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.385 on 503 degrees of freedom
## Multiple R-squared: 0.5954, Adjusted R-squared: 0.5922
## F-statistic: 185 on 4 and 503 DF, p-value: < 2.2e-16
print(sum2)
##
## Time series regression with "ts" data:
## Start = 1970(5), End = 1979(40)
##
## Call:
## dynlm(formula = cmort ~ time(cmort) + temp + tempsq + part +
## L(part, 4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.228 -4.314 -0.614 3.713 27.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.808e+03 1.989e+02 14.123 < 2e-16 ***
## time(cmort) -1.385e+00 1.006e-01 -13.765 < 2e-16 ***
## temp -4.058e-01 3.528e-02 -11.503 < 2e-16 ***
## tempsq 2.155e-02 2.803e-03 7.688 8.02e-14 ***
## part 2.029e-01 2.266e-02 8.954 < 2e-16 ***
## L(part, 4) 1.030e-01 2.485e-02 4.147 3.96e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.287 on 498 degrees of freedom
## Multiple R-squared: 0.608, Adjusted R-squared: 0.6041
## F-statistic: 154.5 on 5 and 498 DF, p-value: < 2.2e-16
print(AIC(lm.out1))
## [1] 3332.282
print(AIC(lm.out2))
## [1] 3291.52
Adding the lagged particulate count significantly reduces the AIC, leading to the selection of the second model. Looking at plots of the fitted values against the residuals:
df1 <- data.frame(time=time(cmort),fitted1=predict(lm.out1),resid1=lm.out1$residuals)
df2 <- data.frame(time=time(lm.out2),fitted2=predict(lm.out2),resid2=lm.out2$residuals)
p <- ggplot(aes(x=fitted1,y=resid1),data=df1)
p + geom_point() + geom_smooth(se=F)
p %+% df2 + aes(x=fitted2,y=resid2) + geom_point() + geom_smooth(se=F)
The plots gives similar results, showing a good fit except for points at larger fitted values.
df3 <- ts.intersect(cmort,stats::lag(part,k=4),temp,part)
df3 %>% pairs()
The graphs appear to show a positive relationship between \(P_{t-4}\) and \(M_t\) as well as \(P_t\) and \(M_t\).
cor(df3)
## cmort stats::lag(part, k = 4) temp
## cmort 1.0000000 0.13975806 -0.44035520
## stats::lag(part, k = 4) 0.1397581 1.00000000 -0.06893067
## temp -0.4403552 -0.06893067 1.00000000
## part 0.4501258 0.53405047 -0.01686434
## part
## cmort 0.45012582
## stats::lag(part, k = 4) 0.53405047
## temp -0.01686434
## part 1.00000000
2.3 In this problem, we explore the difference between a random walk and a trend stationary process.
delta <- 0.01
create_four_drift <- function(x){
w = cumsum(rnorm(100))
return(ts(delta*1:100 + w))
}
series <- lapply(1:4,create_four_drift)
create_four_plots1 <- function(x){
p <- autoplot.zoo(x)
return(p + geom_abline(slope=0.01) + geom_smooth(method='lm',formula=y~x-1,se=F))
}
plots <- lapply(series,create_four_plots1)
marrangeGrob(plots,ncol=2,nrow=2)
create_four_lin <- function(x){
w = rnorm(100)
return(ts(delta*1:100 + w))
}
series2 <- lapply(1:4,create_four_lin)
create_four_plots2 <- function(x){
p <- autoplot.zoo(x)
return(p + geom_abline(slope=0.01) + geom_smooth(method='lm',se=F))
}
plots2 <- lapply(series2,create_four_plots2)
marrangeGrob(plots2,ncol=2,nrow=2)
2.8 The glacial varve record plotted in Figure 2.7 exhibits some nonstationarity that can be improved by transforming to logarithms and some additional nonstationarity that can be corrected by differencing the logarithms.
data("varve")
x_t <- varve
first <- 1:(length(x_t)/2)
first_xt <- x_t[first]
second_xt <- x_t[-first]
print(var(first_xt))
## [1] 133.4574
print(var(second_xt))
## [1] 594.4904
cat("The variance in the first half is ",round(var(first_xt)/var(second_xt),3)," the size of the variance in the second half\n",sep='')
## The variance in the first half is 0.224 the size of the variance in the second half
y_t <- log(x_t)
first_yt <- y_t[first]
second_yt <- y_t[-first]
print(var(first_yt))
## [1] 0.2707217
print(var(second_yt))
## [1] 0.451371
cat("The variance in the first half is ",round(var(first_yt)/var(second_yt),3)," the size of the variance in the second half\n",sep='')
## The variance in the first half is 0.6 the size of the variance in the second half
df <- data.frame(xt = as.numeric(varve),logxt = log(as.numeric(varve)))
p <- df %>% ggplot(aes(xt))
p + geom_histogram()
p %+% aes(logxt) + geom_histogram()
autoplot.zoo(y_t)
There looks to be cyclical behavior, with a cycle every 500 years or so.
ggAcf(y_t)
ut <- y_t-stats::lag(y_t,k=1)
ggAcf(ut)
2.9 In this problem, we will explore the periodic nature of \(S_t\) , the SOI series displayed in Figure 1.5.
data(soi)
lm.out <- lm(soi ~ time(soi),na.action = NULL)
detrend <- lm.out$residuals
summary(lm.out)
##
## Call:
## lm(formula = soi ~ time(soi), na.action = NULL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04140 -0.24183 0.01935 0.27727 0.83866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.70367 3.18873 4.298 2.12e-05 ***
## time(soi) -0.00692 0.00162 -4.272 2.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3756 on 451 degrees of freedom
## Multiple R-squared: 0.0389, Adjusted R-squared: 0.03677
## F-statistic: 18.25 on 1 and 451 DF, p-value: 2.359e-05
p1 <- autoplot.zoo(detrend)
p2 <- autoplot.zoo(soi)
grid.arrange(p1,p2)
2.10 Consider the two weekly time series oil and gas. The oil series is in dollars per barrel, while the gas series is in cents per gallon.
df1 <- data.frame(time=time(oil),var=oil/42)
df2 <- data.frame(time=time(astsa::gas),var=astsa::gas/100)
ggplot() + geom_line(data=df1,aes(x=time,y=var,color='Oil')) + geom_line(data=df2,aes(x=time,y=var,color='Gas'))
These series look like they could be random walk with an upward drift, which are not stationary series.
The series \(y_t\) is: \[ y_t = \log(x_t)-\log(x_{t-1}) = \log\left(\dfrac{x_t}{x_{t-1}}\right) \]
goil <- diff(log(oil))
ggas <- diff(log(astsa::gas))
df1 <- data.frame(time=time(goil),var=goil)
df2 <- data.frame(time=time(ggas),var=ggas)
ggplot() + geom_line(data=df1,aes(x=time,y=var,color='Oil Growth')) + geom_line(data=df2,aes(x=time,y=var,color='Gas Growth'))
The sample ACFs are given by:
acf(goil)
acf(ggas)
cross_corr = ccf(ggas,goil)
multiple_lead <- function(i){
oillead <- stats::lag(oil,k=-i)
goil.lead <- diff(log(oillead))
df <- ts.intersect(goil.lead=goil.lead,ggas=ggas,time=time(goil.lead),dframe=TRUE)
return(ggplot(df,aes(y=ggas)) + geom_point(aes(x=goil.lead)))
}
l <- lapply(1:3,multiple_lead)
marrangeGrob(l,ncol=3,nrow=1)
There have been a number of studies questioning whether gasoline prices respond more quickly when oil prices are rising than when oil prices are falling (“asymmetry”). We will attempt to explore this question here with simple lagged regression; we will ignore some obvious problems such as outliers and autocorrelated errors, so this will not be a definitive analysis. Let \(G_t\) and \(O_t\) denote the gas and oil growth rates.
Fit the regression (and comment on the results) \[ G_t = α_1 + α_2I_t + \beta_1O_t + \beta_2O_{t−1} + wt \] where \(I_t = 1\) if \(O_t \geq 0\) and 0 otherwise (It is the indicator of no growth or positive growth in oil price).
goil <- diff(oil)
ggas <- diff(astsa::gas)
lm.out <- dynlm(ggas ~ goil + stats::lag(goil,-1) + I(goil > 0))
summary(lm.out)
##
## Time series regression with "ts" data:
## Start = 2000(3), End = 2010(25)
##
## Call:
## dynlm(formula = ggas ~ goil + stats::lag(goil, -1) + I(goil >
## 0))
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.387 -2.601 -0.061 2.939 79.107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2089 0.5304 -2.279 0.02304 *
## goil 1.7081 0.1585 10.779 < 2e-16 ***
## stats::lag(goil, -1) 0.2076 0.1147 1.810 0.07083 .
## I(goil > 0)TRUE 2.3204 0.8210 2.826 0.00488 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.786 on 539 degrees of freedom
## Multiple R-squared: 0.3915, Adjusted R-squared: 0.3881
## F-statistic: 115.6 on 3 and 539 DF, p-value: < 2.2e-16
coef(summary(lm.out))[,1]
## (Intercept) goil stats::lag(goil, -1)
## -1.208896 1.708090 0.207631
## I(goil > 0)TRUE
## 2.320438
\[ G_t = -1.21 + 2.32 + 1.71O_{t} + 0.21O_{t-1} \]
The model when the oil growth rate is less than 0:
\[ G_t = -1.21 + 1.71O_{t} + 0.21O_{t-1} \]
Yes, there is evidence of the asymmetry hypothesis. The growth rate of gasoline depends not only on the level of the oil growth rate, but also its sign. A positive growth rate boosts gasoline prices beyond what would be predicted just by the oil growth rate.
df <- data.frame(resid= lm.out$residuals, fitted = predict(lm.out))
ggplot(df,aes(x=fitted,y=resid)) + geom_point() + geom_smooth(se=F)
The fit is fairly good, except for some outliers.
2.11 Use two different smoothing techniques described in Section 2.3 to estimate the trend in the global temperature series globtemp. Comment.
kern <- ksmooth(time(globtemp),globtemp,'normal',bandwidth=4)$y
sp <- smooth.spline(time(globtemp),globtemp,spar=0.6)$y
df <- data.frame(time=time(globtemp),kernel = kern, spline = sp, temp=globtemp)
ggplot(df,aes(x=time)) + geom_line(aes(y=temp)) + geom_line(aes(y=kern,color='Kernel'),lineend='round') +
geom_line(aes(y=sp,color='Smoothing Splines'),lineend='round')
We use kernel smoothing with a bandwidth of 4 and smoothing splines with a smoothing parameter of \(\lambda = 0.6\).